Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines
نویسندگان
چکیده
This paper targets the growing area of interactive data analytics engines. We present a system called Getafix that intelligently decides replication levels and replica placement for data segments, in a way that is responsive to changing popularity of data access by incoming queries. We present an optimal solution to the static version of the problem, achieving minimality in both makespan and replication factor. Based on this intuition we build the Getafix system to handle queries and segments arriving in real time. We integrated Getafix into Druid, a modern open-source interactive data analytics engine. We present experimental results using workloads from Yahoo!’s production Druid cluster. Compared to existing work, Getafix achieves comparable query latency (both average and tail), while using 1.45-2.15× less memory in a private cloud. In a public cloud, for a 100 TB hot dataset size, Getafix can cut dollar costs by as much as 10 million annually with negligible performance impact.
منابع مشابه
DimmWitted: A Study of Main-Memory Statistical Analytics
We perform the first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems differ from conventional SQL-analytics in the amount and types of memory incoherence that they can tolerate. Our goal is to understand tradeoffs in ac...
متن کاملGraph Analytics on Relational Databases
Graph analytics has become increasing popular in the recent years. Conventionally, data is stored in relational databases that have been refined over decades, resulting in highly optimized data processing engines. However, the awkwardness of expressing iterative queries in SQL makes the relational queryprocessing model inadequate for graph analytics, leading to many alternative solutions. Our r...
متن کاملReal-Time Analytics as the Killer Application for Processing-In-Memory
While Processing-In-Memory (PIM) has been widely researched for the last two decades, it was never truly adopted by the industry and remains mostly within the academic research realm. This is mainly because (1) inmemory compute engines were too slow, and (2) a realworld application that could really benefit from PIM was never identified. In recent years, the first argument became untenable, but...
متن کاملDesign and Implementation of a Real-Time Interactive Analytics System for Large Spatio-Temporal Data
In real-time interactive data analytics, the user expects to receive the results of each query within a short time period such as seconds. This is especially challenging when the data is big (e.g., on the scale of petabytes), and the analytics system runs on top of cloud infrastructure (e.g., thousands of interconnected commodity servers). We have been building such a system, called OceanRT, fo...
متن کاملGetafix: Workload-aware Distributed Interactive Analytics
Distributed interactive analytics engines (Druid, Redshift, Pinot) need to achieve low query latency while using the least storage space. This paper presents a solution to the problem of replication of data blocks and routing of queries. Our techniques decide the replication level of individual data blocks (based on popularity, access counts), as well as output optimal placement patterns for su...
متن کامل